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Abstract 

Methods to solve a node discovery problem for a social network are presented. Covert nodes refer to the 
nodes which are not observable directly. They transmit the influence and affect the resulting collaborative 
activities among the persons in a social network, but do not appear in the surveillance logs which record 
the participants of the collaborative activities. Discovering the covert nodes is identifying the suspicious logs 
where the covert nodes would appear if the covert nodes became overt. The performance of the methods is 
demonstrated with a test dataset generated from computationally synthesized networks and a real organization. 
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1 Introduction 

Covert nodes refer to persons who transmit the influence 
and affect the resulting collaborative activities among 
the persons in a social network, but do not appear in 
the surveillance logs which record the participants of the 
activities. The covert nodes are not observable directly, 
ft aids us in discovering and approaching to the covert 
nodes to identify the suspicious surveillance logs where 
the covert nodes would appear if they became overt. I 
call this problem a node discovery problem for a social 
network. 

Where do we encounter such a problem? Globally 
networked clandestine organizations such as terrorists, 
criminals, or drug smugglers are great threat to the 
civilized societies Sageman (2004)| . Terrorism attacks 
cause great economic, social and environmental dam- 
age. Active non-routine responses to the attacks are 
necessary as well as the damage recovery management. 
The short-term target of the responses is the arrest of 
the perpetrators. The long-term target of the responses 
is identifying and dismantling the covert organizational 
foundation which raises, encourages, and helps the per- 
petrators. The threat will be mitigated and eliminated 
by discovering covert leaders and critical conspirators 
of the clandestine organizations. The difficulty of such 
discovery lies in the limited capability of surveillance. 
Information on the leaders and critical conspirators are 
missing because it is usually hidden by the organization 
intentionally. 

Let me show an example in the 9/11 terrorist attack 
in 2001 [Krebs (2002)] . Mustafa A. Al-Hisawi, whose al- 
ternate name was Mustafa Al-Hawsawi, was alleged to 
be a wire-puller who had acted as a financial manager of 
Al Qaeda. He had attempted to help terrorists enter the 
United States, and provided the hijackers of the 4 air- 
crafts with financial support worth more than 300,000 
dollars. Furthermore, Osama bin Laden is suspected to 
be a wire-puller behind Mustafa A. Al-Hisawi and the 
conspirators behind the hijackers. These persons were 
not recognized as wire-pullers at the time of the attack. 
They were the nodes to discover from the information 
on the collaborative activities of the perpetrators and 
conspirators known at that moment. 

In this paper, I present two methods to solve the 
node discovery problem. One is a heuristic method in 
Maeno (2009)] , which demonstrates a simulation ex- 



periment of the node discovery problem for the social 
network of the 9/11 perpetrators. The other is a statis- 
tical inference method which I propose in this paper. 
The method employs the maximal likelihood estima- 
tion and an anomaly detection technique. Section [3] de- 
fines the node discovery problem mathematically. Sec- 
tion |4] presents the two methods. Section [5] introduces 
the test dataset generated from computationally syn- 
thesized networks and a real clandestine organization. 
Section [6] demonstrates the performance characteristics 



of the methods (precision, recall, and van Rijsbergen's 
F measure Korfhuge (1997)| ). Section [7] presents the 
issues and future perspectives as concluding remarks. 
Section [5] summarizes the related works. 



2 Related Work 

The social network analysis is a study of social struc- 
tures made of nodes which are linked by one or more 
specific types of relationship. Examples of the rela- 
tionship are influence transmission in communication or 
presence of trust in collaboration Lavrac (2007)| . Net- 



work topological characteristics of clandestine terrorist 



organizations 


Krebs (2002)] 


Klerks (2002) 


are studied. 



Trade-off between staying 
secret and efficiently securing coordination and control 
is of particular interest Morselli (2007)| . The impact 
of the network topology to the trade-off is analyzed 
Lindelauf (2009)] . 



Research interests have been moving from describ- 
ing organizational structure to discovering dynamical 
phenomena on a social network. A link discovery pre- 
dicts the existence of an unknown link between two 
nodes from the information on the known attributes of 
the nodes and the known links 



Clauset (2008) 



It is 

one of the tasks of link mining Getoor (2005) . The 
link discovery techniques are combined with domain- 
specific heuristics. The collaboration between scien- 
tists can be predicted from the pubhshed co-authorship 

Liben-Nowell (2004)] . The friendship between people 
is inferred from the information available on their web 
pages [Adamic (2003)] . 

Markov random network is a model of the joint prob- 
ability distribution of random variables. It is an undi- 
rected graphical model similar to a Bayesian network. 
The Markov random network is used to learn the de- 
pendency between the links which shares a node. The 
Markov random network is one of the dependence graphs 

Frank (1986)] , which models the dependency between 



links. Extension to hierarchical models Lazega (1999)J , 
multiple networks (treating different types of relation- 
ships) [Pattison (1999)], v alued networks (with nodal 
attributes) [Robins (1999)] , higher order dependency be- 
tween the links which share no nodes [Pattison (2002)] , 
and 2-block chain graphs (associating one set of ex- 
planatory variables with the other set of outcome vari- 
ables) [Robins (2001)[ are studied. A family of such 
extensions and model elaborations is named the expo- 
nential random graph [Anderson (1999)] . 

In addition to the link discovery, the related re- 
search topics are the exploration of an unknown network 
structure Newman (2007)[, the discovery of a commu- 
nity structure Palla (2005)[ , the inference of a network 



topology Rabbat (2008)[, t he detection of an anomaly 
in a network Silva (2009)[, and the discovery of un- 
known nodes [Maeno (200^ , [Maeno (2009)] . Stochas- 
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tic modeling to predict terrorism attacks Singh (2004)| 
is relevant practically. The idea of machine learning of 
latent variables Silva (2006)| is potentially applicable 
to discovering an unknown network structure. 

3 Problem definition 

The node discovery problem is defined mathematically 
in this section. A node represents a person in a so- 
cial network. A link represents a relationship which 
transmits the influence between persons. The symbols 
rij {j = 0, 1, • • ■) represent the nodes. Some nodes are 
overt (observable), but the others are covert (unobserv- 
able). O denotes the overt nodes; {no, ni, • • • , rtAr_i}. 
Its cardinality is |0| = A^. C ~ O denotes the covert 
nodes; {njv, njv+i, • • ■ , n-Af-i}- Its cardinality is \C\ — 
M ~ N . The whole nodes in a social network is O U C 
The number of the nodes is M . The unobservability of 
the covert nodes arises either from a technical defect of 
surveillance means or an intentional cover-up operation. 

The symbol 5i represent a set of participants in a 
particular collaborative activity. It is the i-th activity 
pattern among the nodes. A pattern 5i is a set of nodes; 
5i is a subset of O U C. For example, the nodes in 
an collaborative activity pattern are those who joined 
a particular conference call. That is, a pattern is a 
co-occurrence among the nodes [Rabbat (2008)] . The 
unobservability of the covert nodes does not affect the 
activity patterns themselves. 

A simple hub-and-spoke model is assumed as a model 
of the influence transmission over the links resulting the 
collaborative activities among the nodes. The way how 
the influence is transmitted governs the set of possi- 
ble activity patterns {Si}. The network topology and 
the influence transmission are described by some prob- 
ability parameters. The probability where the influ- 
ence transmits from an initiating node rij to a respon- 
der node Uk is rjk- The influence transmits to multiple 
responders independently in parallel. It is similar to 
the degree of collaboration probability in trust model- 
ing [Lavrac (2007)] . The constraints are < Vjk and 
'^k^j ''^jk ^ 1- The quantity fj is the probability where 
the node rij becomes an initiator. The constraints are 

< fj and ^2^=0^ fj = 1 • These parameters are defined 
for the whole nodes in a social network (both the nodes 
in O and C). 

A surveillance log di records a set of the overt nodes 
in a collaborative activity pattern; 6i. It is given by 
eq.(Il]). A log di is a subset of O. The number of data 
is D. A set {di} is the whole surveillance logs dataset. 



d^ = 6^r\0 {0 <i < D) 



(1) 



Note that neither an individual node nor a single 
link alone can be observed directly, but nodes can be 
observed collectively as a collaborative activity pattern. 
The dataset {di} can be expressed by a 2-dimensional 



D X N matrix of binary variables d. The presence or 
absence of the node Uj in the data di is indicated by the 
elements in eq.®. 



di 



1 if n, ed, (0 < i < < i < N). (2) 
otherwise - - ^ / \ / 



Solving the node discovery problem means identify- 
ing all the surveillance logs where covert nodes would 
appear if they became overt. In other words, it means 
to identifying the logs for which di ^ Si holds because 
of the covert nodes belonging to C. 

4 Solution 

4.1 Heuristic method 

A heuristic method to solve the node discovery problem 
is studied in Maeno (2009)| . The method is reviewed 
briefly. 

At first, every node which appears in the dataset 
{di} is classified into one of the clusters c/ (0 < Z < C). 
The number of clusters is C, which depends on the prior 
knowledge. Mutually close nodes form a cluster. The 
measure of closeness between a pair of nodes is evaluated 
by the Jaccard's coefficient Liben-Nowell (2004)] . It is 
used widely in link discovery, web mining, or text pro- 
cessing. The Jaccard's coefficient between the nodes n 
and n' is defined by eq.([3|). The function B{s) in eq.(l3|) 
is a Boolean function which returns 1 if the proposition 
s is trueCor otherwise. The operators A and V are 
logical AND and OR. 



J(n, n') = 



(3) 



The k-medoids clustering algorithm Hastie (2001)] 
is employed for classification of the nodes. It is an EM 
(expectation-maximization) algorithm similar to the k- 
means algorithm for numerical data. A medoid node 
locates most centrally within a cluster. It corresponds 
to the center of gravity in the k-means algorithm. The 
clusters and the modoid nodes are re-calculated itera- 
tively until they converge into a stable structure. The 
k-medoids clustering algorithm may be substituted by 
other clustering algorithms such as hierarchical cluster- 
ing or self-organizing mapping. 

Then, suspiciousness of every surveillance log c?j as a 
candidate where the covert nodes would appear is eval- 
uated with a ranking function s{di). The ranking func- 
tion returns higher value for a more suspicious log. The 
strength of the correlation between the log di and the 
cluster Q is defined by w{di,ci) in eq.(Il]) as a prepara- 
tion. 



w{di,ci) 



max 



B{nj G di) 



(4) 
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The ranking function takes w{di,ci) as an input. 
Various forms of ranking functions can be constructed. 



D-l 



For example, Maeno (2009)1 studied a simple form in 



eq.(ini) where the function u{x) returns 1 if the real vari- 
able X is positive, or otherwise. 



c-i 



s{di) oc u{w{di,ci)) 

1=0 

c-i 

= ^(^^ ^ 7^ 



(5) 



The z-th most suspicious log is given by do-(i) where 
a(i) is calculated by eq.®- Suspiciousness s{d„^i^) is 
always larger than s{da-[i')) for any i < i' . 

a[i) — arg max s[dm) (1 < « < D). (6) 

m^(T{n) for ^n<i 

The computational burden of the method remains 
light as the number of nodes and surveillance logs in- 
creases. The method is expected to work generally for 
clustered networks but moderately even if the network 
topological and stochastic mechanism to generate the 
surveillance logs is not understood well. The method 
works without the knowledge about the hub-and-spoke 
model; the parametric form with rjk and fj in Section[3l 
The result, however, can not be very accurate because 
of the heuristic nature. A statistical inference method 
which requires heavy computational burden, but out- 
puts more accurate results is presented next. 

4.2 Statistical inference method 

The statistical inference method employs the maximal 
likelihood estimation to infer the topology of the net- 
work, and applies an anomaly detection technique to 
retrieve the suspicious surveillance logs which are not 
likely to realize without the covert nodes. The maximal 
likelihood estimation is a basic statistical method used 
for fitting a statistical model to data and for providing 
estimates for the model's parameters. The anomaly de- 
tection refers to detecting patterns in a given dataset 
that do not conform to an established normal behavior. 

A single symbol represent both of the parame- 
ters rjk and fj for the nodes in O. 6 is the target 
variable, the value of which needs to be inferred from 
the surveillance log dataset. The logarithmic likelihood 
function [Hastie (2001)] is defined by eq.©. The quan- 
tity p{{di}\9) denote the probability where the surveil- 
lance log dataset {di} realizes under a given 0. 



m = iog(p({dj|0)). 



(7) 



The individual surveillance logs are assumed to be 
independent, eq.© becomes eq.®. 



D-l 



L{e) = iogiY[pid,\e)) 



^log(p(d,|0)). 



(8) 



The quantity qi\jk in eq.([9]) is the probability where 
the presence or absence of the node as a responder to 
the stimulating node rij coincides with the surveillance 
log di. 



<li\ik 



1 -r 



■jk 



if dik — 1 for given i and j 
otherwise 



(9) 



Eq. ^ is equivalent to eq. PH]) since the value of dik 
is either or 1. 



Qiljk = dtkrjk + (1 - djfe)(l - rjk)- 



(10) 



The probability p{{di}\9) in eq.® is expressed by 
eq.dll]). 



N-l 



p{di\e) = ^ d,jfj Yl l^jk- 

j=0 0<k<N A k^j 



(11) 



The logarithmic likelihood function takes an explicit 
formula in eq. (|12p . The case k = j in multiplication 
(Ylk) is included since dfj^ = dik always holds. 

D-l N-l N-l 

L{e) = Y,\og{Y,d,,f, X{{i-d,k 

i=0 j=0 k=0 

+ {2d,k ~ l)rjk))- (12) 

The maximal likelihood estimator is obtained by 
solving eq. (|13p . It gives the values of the parameters 
Tjk and fj. A pair of nodes rij and rife for which r^k > 
possesses a link between them. 



9 = argmaxL(0) 




(13) 



A simple incremental optimization technique; the 
hill climbing method (or the method of steepest descent) 
is employed to solve ea.([T3|. Non-deterministic meth- 
ods such as simulated annealing Hastie (2001)| can be 
employed to strengthen the search ability and to avoid 
sub-optimal solutions. These methods search more op- 
timal parameter values around the present values and 
update them as in eq. (|14p until the values converge. 



rjk Tjk + Ar^fe 
fj - fj + A/, 



(0<j,fc<iV). 



(14) 



The change in the logarithmic likelihood function 
can be calculated as a product of the derivatives (dif- 
ferential coefficients with regard to r and /) and the 
amount of the updates in eq.((TS]). The update Ar„m 
and A/„ should be in the direction of the steepest ascent 
in the landscape of the logarithmic likelihood function. 



AL(6») = 



i=0 



dm 

n.m— U 



."f^m^u (15) 



Tl = 



dfn 
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The derivatives with regard to r are given by eq. (|16p . 



dL{e) 



D-1 



[/nd-in(2dim - 1) 

]J {1 - + {2dik - 



dLje) 



N-1 N-1 

-^ ^ dijfj Y[{1~ dik + {2dik - l)rjfc}]. 

j=0 fc=0 

(16) 

The derivatives with regard to / are given by ea. (|17p . 

D-1 N-\ 

^ \din H {1 - <^ife + (2rfjfc - l)fnfe} 
i=0 fc=0 

j=0 fc=0 

(17) 

The ranking function s(di) is the inverse of the prob- 
abihty at which di reaUzes under the maximal hkehhood 
estimator Q. According to the anomaly detection tech- 
nique, it gives a higher return value to the suspicious 
surveillance logs which are less likely to realize with- 
out the covert nodes. The ranking function is given by 
eq.([T8l). 



1 



(18) 



The z-th most suspicious log is given by (io-(i) by the 
same formula in eq.®. 



5 Test Dataset 
5.1 Network 

Two classes of networks are employed to generate a test 
dataset for performance evaluation of the two methods. 
The first class is computationally synthesized networks. 
The second class is a real clandestine organization. 

The networks [A] in Figure [T] and [B] in Figure [5] are 
synthesized computationally. They are based on the 
Barabasi-Albert model Barabasi (1999)| with a clus- 
ter structure. The Barabasi-Albert model grows with 



preferential attachment. The probability where a newly 
coming node rik connects a link to an existing node Uj 
is proportional to the nodal degree of rij {p{k ~^ j) cx 
K{nj)). The occurrence frequency of the nodal degree 
tends to be scale-free {F{K) oc K°'). In the Barabasi- 
Albert model with a cluster structure, every node Uj 
is assigned a pre-determined cluster attribute cirij) to 
which it belongs. The number of clusters is C. The 
probability p{k — > j) is modified to eq. p^ . cluster 
contrast parameter rj is introduced. Links between the 




Figure 1: Computationally synthesized network [A] 
which consists of 101 nodes and 5 clusters. Cluster con- 
trast parameter is = 50. The network is relatively 
more clustered. The node ni2 is a typical hub node. 
The node 7175 is a typical peripheral node. 




Figure 2: Computationally synthesized network [B] 
which consists of 101 nodes and 5 clusters. Cluster con- 
trast parameter is t] — 2.5. The network is relatively 
less clustered. The node ni2 is a typical hub node. The 
node ri,48 is a typical peripheral node. 
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clusters appear less frequently as i] increases. The ini- 
tial links between the clusters are connected at random 
before growth by preferential attachment starts. 



if c{nj) = c{nk) 
otherwise 



(19) 



Hub nodes are those which have a nodal degree larger 
than the average. The node ni2 in the network [A] in 
Figure [U is a typical hub node. Peripheral nodes are 
those which have a nodal degree smaller than the aver- 
age. The node 7175 in the network [A] in Figure [T] is a 
typical peripheral node. 

The network in Figure [3] represents a real clandes- 
tine organization. It is a global mujahedin organization 
which was analyzed in Sageman (2004)1 . The muja- 
hedin in the global Salafi jihad means Muslim fighters 
in Salafism (Sunni Islamic school of thought) who strug- 
gle to establish justice on earth. Note that jihad does 
not necessarily refer to military exertion. The organiza- 
tion consists of 107 persons and 4 regional sub-networks. 
The sub-networks represent Central Staffs (n-csj) in- 
cluding the node nobL, Core Arabs (ncAj) from the 
Arabian Peninsula countries and Egypt, Maghreb Arabs 
{nMAj) from the North African countries, and Southeast 
Asians (nsAj)- The network topology is not simply hi- 
erarchical. The 4 regional sub-networks are connected 
mutually in a complex manner. 

The node representing Osama bin Laden; nobL is a 
hub {K{nohL.) = 8). He is believed to be the founder 
of the organization, and said to be the covert leader 
who provides operational commanders in regional sub- 
networks with financial support in many terrorism at- 
tacks including 9/11 in 2001. His whereabouts are not 
known despite many efforts in investigation and cap- 
ture. 

The topological characteristics of the above men- 
tioned networks are summarized in Table [TJ The global 
mujahedin organization has a relatively large Gini co- 
efficient of the nodal degree; G = 0.35 and a rela- 
tively large average clustering coefficient Watts (1998)| ; 
{W{nj)) = 0.54. In economics, the Gini coefficient is a 
measure of inequality of income distribution or of wealth 
distribution. A larger Gini coefficient indicates lower 
equality. The values mean that the organization pos- 
sesses hubs and a cluster structure. The values also 
indicate that the computationally synthesized network 
[A] is more clustered and close to the global mujahedin 
organization while the network [B] is less clustered. 

5.2 Test Dataset 

The test dataset {di} is generated from each network in 
15.11 in the 2 steps below. 

In the first step, the collaborative activity patterns 
{6i} are generated D times according to the influence 




Figure 3: Social network representing a global muja- 
hedin (Jihad fighters) organization Sageman (2004)| , 
which consists of 107 nodes and 4 regional sub-networks. 
The sub- networks represent Central Staffs {ncsj) in- 
cluding the node nobL, Core Arabs (ncAj), Maghreb 
Arabs {umaj), aud Southeast Asians (nsAj)- The node 
'T-ObL is Osama bin Laden who many believe is the 
founder of the organization. 



Table 1: The number of nodes M, the number of clusters 
C, the average degree {K{nj)), the average clustering 
coefficient {W{nj)), and the Gini coefficient G of the 
computationally synthesized networks (CSN) [A] and 
[B], and the global mujahedin organization (GMO). 



Model 


M C 77 (K) {W) G 


CSN [A] 
CSN [B] 
GMO 


101 5 50 3.6 0.42 0.36 
101 5 2.5 3.9 0.22 0.37 
107 - - 5.1 0.54 0.35 
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transmission under the true value of 6. A pattern in- 
cludes both an initiator node rij and multiple respon- 
der nodes Uk- An example is Scxi ={"-csi, '^CS2, "-CS6, 
"-CS7, »^CS9, "-ObL, "-CS11, "-CS12, "csm} for the global 
mujahedin organization in Figure [3l 

In the second step, the surveillance log dataset {di} 
is generated by deleting the covert nodes belonging to C 
from the patterns {5i}. The example (5cxi results in the 
surveillance log 4xi = S^xi n C = {ncsi, "CS2, "-CS6, 
n-cS7, ncs9, "csii, ?^csi2, "-csm} if Osama bin Laden 
is a cover node; C = nobL- The covert node in C 
may appear multiple times in the collaborative activity 
patterns {Si}. The number of the target logs to identify 
Dt is given by eg. ipO)) . 



D-l 



(20) 



In the performance evaluation in Section [SI a few 
assumptions are made for simplicity. The probability 
fj does not depend on the nodes {fj = The 
value of the probability rjk is either 1 when a link is 
present between nodes, or 1 otherwise. It means that 
the number of the possible collaborative activity pat- 
terns is bounded. The influence transmission is sym- 
metrically bi-directional; Vjk — rkj. 

6 Performance 

6.1 Performance measure 

Three measures, precision, recall, and van Rijsbergen's 
b measure [Korfhuge (1997)] , are used to evaluate the 
performance of the methods. They are commonly used 
in information retrieval such as search, document clas- 
sification, and query classification. The precision p is 
used as evaluation criteria, which is the fraction of the 
number of relevant data to the number of the all data 
retrieved by search. The recall r is the fraction of the 
number of the data retrieved by search to the num- 
ber of the all relevant data. The relevant data refers 
to the data where di ^ Si. They are given by eq. (PT|) 
and ea. l|22|) They are functions of the number of the re- 
trieved data Z?r- It can take the value from 1 to D. The 



data is retrieved in the order of d, 



'(t(1): "ct(2)j 



to d, 



T,f=iB{d„{i) + (5^(^)) 



(21) 



2p(A)r(A) 
p{D,) + r(A) ■ 



(23) 



The precision, recall, and F measure range from 
to 1. All the measures take larger values as the perfor- 
mance of retrieval becomes better. 

6.2 Comparison 

The performance of the heuristic method and statistical 
inference method is compared with the test dataset gen- 
erated from the computationally synthesized networks. 

Figure m shows the precision p{D^) as a function of 
the rate of the retrieved data to the whole data D^jD 
in case the hub node n\2 in the computationally syn- 
thesized network [A] in Figure [1] is the target covert 
node to discover, C = {n\i\. The three graphs are 
for [a] the statistical inference method, [b] the heuristic 
method (C = 5), and [c] the heuristic method (C = 10). 
The number of the surveillance logs in a test dataset is 
D = 100. The broken lines indicate the theoretical limit 
(the upper bound) and the random retrieval (the lower 
bound). The vertical solid line indicates the position 
where A = -Dt- Figure E] shows the recall r(A) as a 
function of the rate D^jD. Figure [5] shows the F mea- 
sure F{D-c) as a function of the rate D^/D. The exper- 
imental conditions are the same as those for Figure [H 
The performance of the heuristic method is moderately 
good if the number of clusters is known as prior knowl- 
edge. Otherwise, the performance would be worse. On 
the other hand, the statistical inference method sur- 
passes the heuristic method and approaches to the the- 
oretical limit. 

Figure [7] shows the F measure -F(A) as a function of 
the rate D^/D in case the hub node n\2 in the network 
[B] in Figure[2]is the target covert node to discover. The 
two graphs are for [a] the statistical inference method 
and [b] the heuristic method (C = 5). The performance 
of the statistical inference method is still good while 
that of the heuristic method becomes worse in a less 
clustered network. 

Figure [5] shows the F measure F{Dr) as a function 
of the rate D^/D in case the peripheral node 7175 in 
the network [A] in Figure [1] is the target covert node 
to discover. Figure [9] shows the F measure F{D^) as a 
function of the rate D^/ D when the peripheral node 7143 
in the network [B] in Figure [5] is the target covert node 
to discover. The statistical inference method works fine 
while the heuristic method fails. 



^(^^)^E£iSK«^5^^ (22) 6.3 Application 

I illustrate how the method aids the investigators in 
The F measure F is the harmonic mean of the pre- achieving the long-term target of the non-routine re- 
cision and recall It is given by eq ((23)l spouses to the terrorism attacks. Let's assume that the 

investigators have surveillance logs of the members of 

the global mujahedin organization except Osama bin 

) + 7(75^) Laden by the time of the attack. Osama bin Laden 



F(A) = ^ 
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Rate of the retrieved data (DJD) Rate of the retrieved data {DJD) 



Figure 4: Precision as a function of the rate of the 

retrieved data to the whole data D^jD in case the hub 
node nyi in the computationally synthesized network 
[A] in Figure [1] is the target covert node to discover. 
C = {ni2}. \C\ = 1. |0| = 100. D = 100. The 
three graphs are for [a] the statistical inference method, 
[b] the heuristic method (C = 5), and [c] the heuristic 
method (C — 10). The broken lines indicate the theo- 
retical limit (the upper bound) and the random retrieval 
(the lower bound). The vertical solid line indicates the 
position where = D^. 
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Figure 5: Recall r{D^) as a function of the rate D^/D. 
The experimental conditions are the same as those for 
Figure HI 
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Figure 6: F measure F{D^) as a function of the rate 
Dj:/D. The experimental conditions arc the same as 
those for Figure HI 



Figure 7: F measure F{D,-) as a function of the rate 
D-c/ D in case the hub node ni2 in the computationally 
synthesized network [B] in Figure [2] is the target covert 
node to discover. Two graphs are for [a] the statistical 
inference method, and [b] the heuristic method (C = 5). 




Rate of the retrieved data {DJD) 

Figure 8: F measure F{Di) as a function of the rate 
Di/ D in case the peripheral node 7175 in the computa- 
tionally synthesized network [A] in Figurc[l]is the target 
covert node to discover. Two graphs are for [a] the sta- 
tistical inference method, and [b] the heuristic method 
(C = 5). 
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Figure 9: F measure F{D,-) as a function of the rate 
D^/ D when the peripheral node n^a in the computa- 
tionally synthesized network [B] in Figure[2]is the target 
covert node to discover. Two graphs are for [a] the sta- 
tistical inference method, and [b] the heuristic method 
(C = 5). 



8 




0.2 0.4 0.6 0.8 1 

Rate of the retrieved data (D/D) 



Figure 10: F measure F{Dr) as a function of the rate 
of the retrieved data to the whole data Di /D when the 
statistical inference method is applied in case the node 
'^■ObL in Figure [3] is the target covert node to discover. 
C = {nobL}- |C| = 1. \0\ = 106. The graph is 
for the statistical inference method. The broken lines 
indicate the theoretical limit and the random retrieval. 
The vertical solid line indicates the position where = 
A. 

does not appear in the logs. This is the assumption 
that the investigators neither know the presence of a 
wire-puller behind the attack nor recognize Osama bin 
Laden at the time of the attack. 

The situation is simulated computationally like the 
problems addressed in 16.21 In this case, the node TiobL 
in Figure [3] is the target covert node to discover, C = 
{nohh}- Figure Uni shows F{D^) as a function of the 
rate of the retrieved data to the whole data D,-/D when 
the statistical inference method is applied. The result 
is close to the theoretical limit. The most suspicious 
surveillance log 'icr(i) includes all and only the neighbor 
nodes ncsi, ncs2, "cse, "•CS7, "CS9, "CSii, "CS12, and 
"■CS14- This encourages the investigators to take an ac- 
tion to investigate an unknown wire-puller near these 
8 neighbors; the most suspicious close associates. The 
investigators will decide to collect more detailed infor- 
mation on the suspicious neighbors. It may result in 
approaching to and finally capturing the covert wire- 
puller responsible for the attack. 

The method, however, fails to identify two suspicious 
records i5fli={nobL, ncsii} and 6^2 = {"ObL, "0312}- 
These nodes have a small nodal degree; -fir(ncsii) = 1 
and i4r(ncsi2) — 1- This shows that the surveillance logs 
on the nodes having small nodal degree do not provide 
the investigators with much clues for the covert nodes. 

7 Conclusion 

In this paper, I define the node discovery problem for a 
social network and present methods to solve the prob- 
lem. The statistical inference method employs the max- 
imal likelihood estimation to infer the topology of the 



network, and applies an anomaly detection technique to 
retrieve the suspicious surveillance logs which are not 
likely to realize without the covert nodes. The pre- 
cision, recall, and F measure characteristics are close 
to the theoretical limit for the discovery of the covert 
nodes in computationally synthesized networks and a 
real clandestine organization. In the investigation of a 
clandestine organization, the method aids the investiga- 
tors in identifying the close associates and approaching 
to a covert leader or a critical conspirator. 

The node discovery problem is encountered in many 
areas of business and social sciences. For example, in 
addition to the analysis of a clandestine organization, 
the method contributes to detecting an individual em- 
ployee who transmits the influence to colleagues, but 
whose catalytic role is not recognized by company man- 
agers, may be critical in reorganizing a company struc- 
ture. 

I plan to address two issues for the future works. The 
first issue is to extend the hub-and-spoke model for the 
influence transmission. The model represents the ra- 
dial transmission from an initiating node toward multi- 
ple responder nodes. Other types of influence transmis- 
sion are present in many real social networks. Examples 
are serial chain-shaped influence transmission model or 
tree-like influence transmission model. The second issue 
is to develop a method to solve the variants of the node 
discovery problem. Discovering fake nodes, or spoofing 
nodes are also interesting problems to uncover the ma- 
licious intentions of the organization. A fake node is 
the person who does not exist in the organization, but 
appears in the surveillance. A spoofing node is the per- 
son who belongs to an organization, but appears as a 
different node in the surveillance logs. 
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